1
00:00:00,025 --> 00:00:02,950
[SOUND] Hello,

2
00:00:05,200 --> 00:00:05,960
everyone.

3
00:00:05,960 --> 00:00:09,880
Welcome back to the Heterogeneous Parallel
Programming class.

4
00:00:09,880 --> 00:00:13,190
This is lecture 1.3, Portability and

5
00:00:13,190 --> 00:00:16,680
Scalability in Heterogeneous parallel
port, computing.

6
00:00:17,790 --> 00:00:21,350
The objective of this lecture is to help
you to understand

7
00:00:21,350 --> 00:00:26,520
the importance and nature of scalability
and portability in parallel programming.

8
00:00:28,920 --> 00:00:34,750
This slide shows the data that was
published by IBM

9
00:00:34,750 --> 00:00:39,760
in 2010.
And this graph shows that the hardware

10
00:00:39,760 --> 00:00:45,090
costs, and software costs have been both
growing exponentially

11
00:00:45,090 --> 00:00:50,510
over the years.
And this is a log plot, so a linear curve

12
00:00:50,510 --> 00:00:54,660
in this plot is a exponential curve.
And the

13
00:00:54,660 --> 00:00:58,870
software cost is measured by lines per
chip.

14
00:00:58,870 --> 00:01:03,600
And it has been growing by two times every
ten months.

15
00:01:03,600 --> 00:01:06,930
And the hardware cost is measured by gates
per chip.

16
00:01:06,930 --> 00:01:13,400
And this is meas- has been increasing by
two times every 18 months.

17
00:01:13,400 --> 00:01:16,010
But therefore, the software cost has been

18
00:01:16,010 --> 00:01:19,180
growing much faster than the hardware
cost.

19
00:01:19,180 --> 00:01:19,730
So even though

20
00:01:19,730 --> 00:01:27,120
the software cost started to be lower than
hardware, but at this point in after

21
00:01:27,120 --> 00:01:31,470
2010 the software cost has essentially
exceeded the

22
00:01:31,470 --> 00:01:34,440
hardware cost and it has been growing
faster.

23
00:01:34,440 --> 00:01:37,530
So the software cost is going to be much,

24
00:01:37,530 --> 00:01:40,900
much more than hardware costs in the years
to come.

25
00:01:40,900 --> 00:01:45,060
So in the future, all the system need to
be able

26
00:01:45,060 --> 00:01:50,230
to minimize software development cost or
redevelopment cost and

27
00:01:50,230 --> 00:01:55,500
this leads to the consideration of
scalability and, and portability.

28
00:01:57,580 --> 00:02:01,770
The first aspect of software cost control
is scalability.

29
00:02:01,770 --> 00:02:05,245
If we develop an application to run well
on Core A,

30
00:02:05,245 --> 00:02:07,970
what we would like to do is, we would like
to make

31
00:02:07,970 --> 00:02:13,400
sure that that same application without
significant re-, redevelopment

32
00:02:13,400 --> 00:02:19,246
can run efficiently on the next version of
Core A, let's say Core A 2.0.

33
00:02:19,246 --> 00:02:22,580
So this allows us to to use

34
00:02:22,580 --> 00:02:27,288
the same application when a new generation
of hardware is introduced.

35
00:02:27,288 --> 00:02:32,090
And when, when, whenever scalability holds
then the

36
00:02:32,090 --> 00:02:35,790
developer does not need to re-revise the
hard-, the

37
00:02:35,790 --> 00:02:37,720
software in order to run well on the

38
00:02:37,720 --> 00:02:42,070
new generation of software, therefore
reducing the redevelopment cost.

39
00:02:44,460 --> 00:02:47,660
There is another dimension of Scalability.

40
00:02:47,660 --> 00:02:53,087
Whenever an application runs well on one
Core A, we will also like it to

41
00:02:53,087 --> 00:02:57,436
to run well on multiples of these Core As
or more of a same cores.

42
00:02:57,436 --> 00:03:00,316
And this allows us to to, to add

43
00:03:00,316 --> 00:03:05,149
performance by adding more hardware into
the system.

44
00:03:05,149 --> 00:03:10,369
Many in many situations, the vendors would
like to introduce

45
00:03:10,369 --> 00:03:14,893
multiple versions of hardware and each
version will have

46
00:03:14,893 --> 00:03:19,630
increasingly more amount of hardware
available to the users.

47
00:03:19,630 --> 00:03:22,886
So if we could develop a piece of software
that is

48
00:03:22,886 --> 00:03:25,920
scalable of to run well on more of the
same cores,

49
00:03:25,920 --> 00:03:30,656
then this gives the vendor the the
scalability of their hardware,

50
00:03:30,656 --> 00:03:35,466
so that when they introduce more hardware
the users can actually observe

51
00:03:35,466 --> 00:03:38,530
increased performance from the
application.

52
00:03:41,150 --> 00:03:46,790
In the future, we expect that there will
be several generations of hardware

53
00:03:46,790 --> 00:03:52,160
where performance will be increased by
adjusting many of these parameters.

54
00:03:52,160 --> 00:03:54,830
For example, the number of compute units
or

55
00:03:54,830 --> 00:03:58,250
the number of cores, and the number of
threads,

56
00:03:58,250 --> 00:04:01,124
the number of the increasing vector
length, the

57
00:04:01,124 --> 00:04:06,550
increased increased pipeline depth, and
increased DRAM burst size,

58
00:04:06,550 --> 00:04:12,630
increased number of DRAM channels, and
increasing data movement latency.

59
00:04:12,630 --> 00:04:18,400
All these hardware parameters will can
significantly affect the

60
00:04:18,400 --> 00:04:23,780
performance of the application, and
oftentimes applications need to be able

61
00:04:23,780 --> 00:04:29,680
to to be tuned, choose some settings of
these parameters.

62
00:04:29,680 --> 00:04:31,740
So, but the programming

63
00:04:31,740 --> 00:04:37,758
style that we use in this le-, course
addresses these needs by supporting

64
00:04:37,758 --> 00:04:42,610
fine-grained problem decomposition and
dynamic thrust scheduling.

65
00:04:42,610 --> 00:04:45,490
So that the application that you write,
according

66
00:04:45,490 --> 00:04:48,090
to this programming style, will be able to

67
00:04:48,090 --> 00:04:52,140
automatically adjust to a fairly wide
range of

68
00:04:52,140 --> 00:04:56,530
parameter values that the hardware venders
may change.

69
00:04:56,530 --> 00:05:02,106
So this allows your application to to run
well on one generation

70
00:05:02,106 --> 00:05:07,070
of hardware and continue to run well on a
future generation of hardware.

71
00:05:07,070 --> 00:05:11,630
And also if your application runs well on
one of, of the cores, you

72
00:05:11,630 --> 00:05:16,100
can expect the application to also run
well on more of the same cores.

73
00:05:18,790 --> 00:05:23,840
The second dimension of software cost
control is portability.

74
00:05:23,840 --> 00:05:31,460
Portability is defined as if we develop an
application to run well on

75
00:05:31,460 --> 00:05:37,550
core A, we would also like it to be able
to run well on different types of cores.

76
00:05:37,550 --> 00:05:41,650
In this case, core B and core C in the
picture.

77
00:05:41,650 --> 00:05:43,480
So oftentimes,

78
00:05:43,480 --> 00:05:48,212
the application developed for core A may

79
00:05:48,212 --> 00:05:53,090
be initially running on one vendor
product.

80
00:05:53,090 --> 00:06:00,870
And but if the application is portable
then the users can expect to run the

81
00:06:00,870 --> 00:06:04,400
same application on different hardware
types, oftentimes

82
00:06:04,400 --> 00:06:08,100
from different hardware vendors to also
run well.

83
00:06:08,100 --> 00:06:09,380
So these,

84
00:06:09,380 --> 00:06:14,210
this can also decrease the software cost
because the

85
00:06:14,210 --> 00:06:18,840
developer will not need to redevelop or
revise their application so

86
00:06:18,840 --> 00:06:22,710
that the application can run well on other
types of vendors systems.

87
00:06:24,540 --> 00:06:29,740
And a lot of times we will see different
design styles from

88
00:06:29,740 --> 00:06:34,970
different vendors.
For GPUs, oftentimes we will see very

89
00:06:34,970 --> 00:06:39,050
significant design styles, as illustrated
in this picture.

90
00:06:39,050 --> 00:06:44,601
And in, in terms of the particular kinds
of differences,

91
00:06:44,601 --> 00:06:50,061
we can expect to see that for the CPU
cores we can, we would

92
00:06:50,061 --> 00:06:55,521
see different instructions and
architecture such as X86 versus

93
00:06:55,521 --> 00:07:01,230
ARM versus other types of instructions set
architectures.

94
00:07:01,230 --> 00:07:05,374
And these iinstructions and architectures
oftentimes would

95
00:07:05,374 --> 00:07:09,790
required different compiler co-generation
and so on.

96
00:07:09,790 --> 00:07:13,580
So and that, that oftentimes it

97
00:07:13,580 --> 00:07:16,650
will affect the portability of your
application.

98
00:07:16,650 --> 00:07:21,690
And we also have different design styles
even based on

99
00:07:21,690 --> 00:07:26,480
the same instructions and architecture, we
could have Latency oriented CPU

100
00:07:26,480 --> 00:07:30,220
designs versus throughput oriented GPU
designs.

101
00:07:30,220 --> 00:07:35,690
And so when we develop a a piece of code,

102
00:07:35,690 --> 00:07:40,420
can we expect to run we, can we expect it
to run well on a

103
00:07:41,625 --> 00:07:45,968
latency-oriented CPU if it runs well on a

104
00:07:45,968 --> 00:07:51,690
through-put oriented GPU?
And the third kind of dimension

105
00:07:51,690 --> 00:07:54,900
of portability often comes with different

106
00:07:54,900 --> 00:07:58,980
styles of parallelism in the processor
core.

107
00:07:58,980 --> 00:08:01,780
There's a design style called VRW and
there is

108
00:08:01,780 --> 00:08:06,122
a design style SIMD and there's a design
style multi-threading.

109
00:08:06,122 --> 00:08:11,680
So we're going to actually touch quite a
bit on this level

110
00:08:11,680 --> 00:08:16,770
of differences for hardware.
And then it also will come

111
00:08:16,770 --> 00:08:20,250
to how we organize DRAMs in the system,
whether

112
00:08:20,250 --> 00:08:24,380
it's a shared memory model or distributed
memory model.

113
00:08:24,380 --> 00:08:31,520
These all these dimensions would affect
the portability of your application.

114
00:08:31,520 --> 00:08:36,570
So when we work towards the end of the
course, we will

115
00:08:36,570 --> 00:08:41,520
have, will introduce emerging standards
such as OpenCL and

116
00:08:41,520 --> 00:08:44,440
Heterogeneous System Architecture, that
will help

117
00:08:44,440 --> 00:08:46,880
to address the portability of your
applications.

118
00:08:49,550 --> 00:08:53,689
So at this point, we have completed all
the the

119
00:08:53,689 --> 00:08:58,240
high-level introduction and complex
information lectures.

120
00:08:58,240 --> 00:09:03,310
So starting from next lecture we are going
to be introducing the CUDA programming

121
00:09:03,310 --> 00:09:09,098
interface and begin to to help you to
develop your lab

122
00:09:09,098 --> 00:09:14,630
application assignments.
So for those of you who like to learn

123
00:09:14,630 --> 00:09:19,270
more about the topic and the context
information I would

124
00:09:19,270 --> 00:09:23,440
like to encourage you to read chapter 1 of
the textbook.

125
00:09:23,440 --> 00:09:23,820
Thank you.